Distributed Submodular Cover: Succinctly Summarizing Massive Data

نویسندگان

  • Baharan Mirzasoleiman
  • Amin Karbasi
  • Ashwinkumar Badanidiyuru
  • Andreas Krause
چکیده

How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition prevalent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is impractical for truly large-scale problems. In this work, we develop the first distributed algorithm – DISCOVER – for submodular set cover that is easily implementable using MapReduce-style computations. We theoretically analyze our approach, and present approximation guarantees for the solutions returned by DISCOVER. We also study a natural trade-off between the communication cost and the number of rounds required to obtain such a solution. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including active set selection, exemplar based clustering, and vertex cover on tens of millions of data points using Spark.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Distributed Submodular Cover: Public-Private Data Summarization

In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some priva...

متن کامل

Combinatorial Problems with Discounted Price Functions in Multi-agent Systems

Motivated by economic thought, a recent research agenda has suggested the algorithmic study of combinatorial optimization problems under functions which satisfy the property of decreasing marginal cost. A natural first step to model such functions is to consider submodular functions. However, many fundamental problems have turned out to be extremely hard to approximate under general submodular ...

متن کامل

Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable yet representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly larg...

متن کامل

Optimal Approximation Algorithms for Multi-agent Combinatorial Problems with Discounted Price Functions

Submodular functions are an important class of functions in combinatorial optimization which satisfy the natural properties of decreasing marginal costs. The study of these functions has led to strong structural properties with applications in many areas. Recently, there has been significant interest in extending the theory of algorithms for optimizing combinatorial problems (such as network de...

متن کامل

The Power of Randomization: Distributed Submodular Maximization on Massive Datasets

A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machine. We develop a simple distributed algorithm that is embarrassingly parallel and it achieves prova...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015